Mining Naturally-occurring Corrections and Paraphrases from Wikipedia's Revision History

نویسندگان

  • Aurélien Max
  • Guillaume Wisniewski
چکیده

Naturally-occurring instances of linguistic phenomena are important both for training and for evaluating automatic text processing. When available in large quantities, they also prove interesting material for linguistic studies. In this article, we present WiCoPaCo (Wikipedia Correction and Paraphrase Corpus), a new freely-available resource built by automatically mining Wikipedia’s revision history. The WiCoPaCo corpus focuses on local modifications made by human revisors and include various types of corrections (such as spelling error or typographical corrections) and rewritings, which can be categorized broadly into meaning-preserving and meaning-altering revisions. We present an initial hand-built typology of these revisions, but the resource allows for any possible annotation scheme. We discuss the main motivations for building such a resource and describe the main technical details guiding its construction. We also present applications and data analysis on French and report initial results on spelling error correction and morphosyntactic rewriting. The WiCoPaCo corpus can be freely downloaded from http://wicopaco.limsi.fr.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

External Plagiarism Detection based on Human Behaviors in Producing Paraphrases of Sentences in English and Persian Languages

With the advent of the internet and easy access to digital libraries, plagiarism has become a major issue. Applying search engines is one of the plagiarism detection techniques that converts plagiarism patterns to search queries. Generating suitable queries is the heart of this technique and existing methods suffer from lack of producing accurate queries, Precision and Speed of retrieved result...

متن کامل

Squibs: What Is a Paraphrase?

Paraphrases are sentences or phrases that convey the same meaning using different wording. Although the logical definition of paraphrases requires strict semantic equivalence, linguistics accepts a broader, approximate, equivalence—thereby allowing far more examples of “quasiparaphrase.” But approximate equivalence is hard to define. Thus, the phenomenon of paraphrases, as understood in linguis...

متن کامل

Validation sur le Web de reformulations locales: application à la Wikipédia (Assisted Rephrasing for Wikipedia Contributors through Web-based Validation) [in French]

Assisted rephrasing for Wikipedia contributors through Web-based validation This works describes initial experiments on the validation of paraphrases in context. Wikipedia’s revisions are used : we assume that a set of possible rewritings are available for a given phrase that has been rewritten in the encyclopedia’s revision history, and we attempt to find the subset of those rewritings that ca...

متن کامل

Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia's Edit History

We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but...

متن کامل

Web-based Validation for Contextual Targeted Paraphrasing

In this work, we present a scenario where contextual targeted paraphrasing of sub-sentential phrases is performed automatically to support the task of text revision. Candidate paraphrases are obtained from a preexisting repertoire and validated in the context of the original sentence using information derived from the Web. We report on experiments on French, where the original sentences to be r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010